0
Follow
0
View

Requesting a .gz file then opening as a dataframe

cymx2012 注册会员
2023-01-25 14:38

import requests
import pandas as pd
from io import BytesIO

url = 'https://github.com/apache-superset/examples-data/blob/master/san_francisco.csv.gz'

# This is required to access raw binary files on Github
# i.e. it appends the following to the URL: `?raw=true`.
query = {'raw': 'true'} 

# Use the requests module to parse the URL with the provided parameters.
response = requests.get(url, params=query)

# Create a file pointer initialized to the content of the response
# using `BytesIO`. This is a psuedo-file, which can now be read
# using `pandas.read_csv`. Since `response.content` is binary data
# i.e. bytes, we use `BytesIO`. If the response was text, we would
# have used `StringIO`.
fp = BytesIO(response.content)

# Finally, parse the content into a DataFrame 
# (populate other parameters as needed).
df = pd.read_csv(fp, compression='gzip')
print(df)

This should return the contents of the csv.gz file as a DataFrame. Using the URL in this example yields the following output:

               LON        LAT NUMBER            STREET UNIT  CITY  DISTRICT  REGION  POSTCODE  ID
0      -122.391267  37.769093   1550       04th Street  NaN   NaN       NaN     NaN     94158 NaN
1      -122.390850  37.769426   1505       04th Street  NaN   NaN       NaN     NaN     94158 NaN
2      -122.428577  37.780627   1160   Buchanan Street  NaN   NaN       NaN     NaN     94115 NaN
3      -122.428534  37.780385   1142   Buchanan Street  NaN   NaN       NaN     NaN     94115 NaN
4      -122.428525  37.780317   1140   Buchanan Street  NaN   NaN       NaN     NaN     94115 NaN
...            ...        ...    ...               ...  ...   ...       ...     ...       ...  ..
261547 -122.418380  37.808349    360  Jefferson Street  NaN   NaN       NaN     NaN     94133 NaN
261548 -122.418380  37.808349    350  Jefferson Street  NaN   NaN       NaN     NaN     94133 NaN
261549 -122.417829  37.807479    333  Jefferson Street  NaN   NaN       NaN     NaN     94133 NaN
261550 -122.418916  37.809044   1965      Al Scoma Way  NaN   NaN       NaN     NaN     94133 NaN
261551 -122.444322  37.749124    350    Glenview Drive  NaN   NaN       NaN     NaN     94131 NaN

I used a sample csv.gz file I found online as the URL and parameters yield a JPEG image instead which is a bit puzzling. In any case, adapt this snippet of code to your case, and it should produce the desired results.

About the Author

Question Info

Publish Time
2023-01-25 14:38
Update Time
2023-01-25 14:38