Python bs4.dammit.EncodingDetector() Examples
The following are 22
code examples of bs4.dammit.EncodingDetector().
You can vote up the ones you like or vote down the ones you don't like,
and go to the original project or source file by following the links above each example.
You may also want to check out all available functions/classes of the module
bs4.dammit
, or try the search function
.
Example #1
Source File: _lxml.py From ServerlessCrawler-VancouverRealState with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #2
Source File: _lxml.py From moviegrabber with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #3
Source File: _lxml.py From Tautulli with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, str): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, str): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #4
Source File: _lxml.py From bazarr with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, str): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, str): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #5
Source File: _lxml.py From bazarr with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #6
Source File: _lxml.py From MIA-Dictionary-Addon with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, str): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, str): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #7
Source File: _lxml.py From POC-EXP with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #8
Source File: _lxml.py From ru with GNU General Public License v2.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #9
Source File: _lxml.py From MARA_Framework with GNU Lesser General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #10
Source File: _lxml.py From CrisisMappingToolkit with Apache License 2.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #11
Source File: _lxml.py From nbaplus-server with Apache License 2.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #12
Source File: _lxml.py From stopstalk-deployment with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #13
Source File: _lxml.py From nzb-subliminal with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #14
Source File: _lxml.py From B.E.N.J.I. with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, str): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, str): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #15
Source File: _lxml.py From svg-animation-tools with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #16
Source File: _lxml.py From svg-animation-tools with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #17
Source File: _lxml.py From fuzzdb-collect with GNU General Public License v3.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #18
Source File: _lxml.py From locality-sensitive-hashing with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #19
Source File: _lxml.py From pledgeservice with Apache License 2.0 | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector(markup, try_encodings, is_html) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #20
Source File: _lxml.py From ServerlessCrawler-VancouverRealState with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #21
Source File: _lxml.py From ServerlessCrawler-VancouverRealState with MIT License | 5 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """ :yield: A series of 4-tuples. (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for parsing the document. """ # Instead of using UnicodeDammit to convert the bytestring to # Unicode using different encodings, use EncodingDetector to # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, unicode): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, unicode): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)
Example #22
Source File: _lxml.py From V1EngineeringInc-Docs with Creative Commons Attribution Share Alike 4.0 International | 4 votes |
def prepare_markup(self, markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None): """Run any preliminary steps necessary to make incoming markup acceptable to the parser. lxml really wants to get a bytestring and convert it to Unicode itself. So instead of using UnicodeDammit to convert the bytestring to Unicode using different encodings, this implementation uses EncodingDetector to iterate over the encodings, and tell lxml to try to parse the document as each one in turn. :param markup: Some markup -- hopefully a bytestring. :param user_specified_encoding: The user asked to try this encoding. :param document_declared_encoding: The markup itself claims to be in this encoding. :param exclude_encodings: The user asked _not_ to try any of these encodings. :yield: A series of 4-tuples: (markup, encoding, declared encoding, has undergone character replacement) Each 4-tuple represents a strategy for converting the document to Unicode and parsing it. Each strategy will be tried in turn. """ is_html = not self.is_xml if is_html: self.processing_instruction_class = ProcessingInstruction else: self.processing_instruction_class = XMLProcessingInstruction if isinstance(markup, str): # We were given Unicode. Maybe lxml can parse Unicode on # this system? yield markup, None, document_declared_encoding, False if isinstance(markup, str): # No, apparently not. Convert the Unicode to UTF-8 and # tell lxml to parse it as UTF-8. yield (markup.encode("utf8"), "utf8", document_declared_encoding, False) try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) for encoding in detector.encodings: yield (detector.markup, encoding, document_declared_encoding, False)